R Bootcamp 2020

Manipulations

  • Data integrity
  • Reshaping
  • Filtering
  • Merging
  • Summarizing

Packages that we will use

# Make tidyverse load quietly
options(tidyverse.quiet = TRUE)
library(tidyverse)   #< General use
library(tidyr)       #< Reshaping
library(wbstats)     #< World bank data.
library(countrycode) #< Country coding
library(assertthat)  #< Results checking
library(lubridate)   #< Date manipulations
requireNamespace('zoo')#< time series

Loading data

for .RData files use load()

load("data/ebola.data.RData")

for .rds files use readRDS() and capture the results in a variable.

ebola <- readRDS("data/ebola.data.rds")

Always check your data

Any problems with the data?

Ebola Data Problem

The Most obvious is that Country Report Date should be repeated down the rows.

Fix with dplyr::mutate() + zoo::na.locf() (Missing last observation carried forward.)

mutate() variants

  • mutate() - modify/add variables
  • mutate_at() - modify a set of variables.
  • mutate_if() - modify variables meeting a criteria
  • transmute() - create a new set of variables based on previous.

Ebola Data Problem

ebola.data %<>% 
    mutate_at('Country Report Date', zoo::na.locf)

Notes:

  1. We used the assign-back pipe %<>% to modify in place. Generally this is frowned upon
  2. Note the double colon for using na.locf from zoo without attaching the package.

select() - Choosing variables

Use select() to choose the variables desired.

Basic Usage

select(data, ...)

Over the next few examples we will explore the forms ... can take

select() - Variable Names

the easiest is with variable names:

ebola.data %>% select(Country, `Case def.`, `Total cases`) %>% head()
Country Case def. Total cases
Guinea Confirmed 3351
Guinea Probable 453
Guinea Suspected 0
Guinea All 3804
Liberia Confirmed 3151
Liberia Probable 1879

select() - Dropping by Variable Names

You can select everything but a variable with the minus operator

ebola.data %>% select(-`Total cases`) %>% head()
SheetName Country Case def. Total deaths Country Report Date
Jan 06, 2016 Guinea Confirmed 2083 2015-12-27
Jan 06, 2016 Guinea Probable 453 2015-12-27
Jan 06, 2016 Guinea Suspected 0 2015-12-27
Jan 06, 2016 Guinea All 2536 2015-12-27
Jan 06, 2016 Liberia Confirmed 0 2015-05-09
Jan 06, 2016 Liberia Probable 0 2015-05-09

select() - By the numbers

You can select by variable position as well.

ebola.data %>% select(1:4) %>% head()
SheetName Country Case def. Total cases
Jan 06, 2016 Guinea Confirmed 3351
Jan 06, 2016 Guinea Probable 453
Jan 06, 2016 Guinea Suspected 0
Jan 06, 2016 Guinea All 3804
Jan 06, 2016 Liberia Confirmed 3151
Jan 06, 2016 Liberia Probable 1879

select() - by variable range

Use single colon : with variable names to select variables named and eveything in between:

ebola.data %>% select(Country:`Total cases`) %>% head()
Country Case def. Total cases
Guinea Confirmed 3351
Guinea Probable 453
Guinea Suspected 0
Guinea All 3804
Liberia Confirmed 3151
Liberia Probable 1879

select() - by helpers

selection helpers are also provided:

ebola.data %>% select(starts_with("Total")) %>% head()
Total cases Total deaths
3351 2083
453 453
0 0
3804 2536
3151 0
1879 0

select() - The helpers

The available helpers are:

  • starts_with()
  • ends_with()
  • contains() - must match literally
  • matches() - Regular expression match
  • num_range() - numerical ranged variables with a prefix
  • all_of() - must match all of given variables named in a vector.
  • any_of() - select any variables present in given vector, but no error if not present.
  • everything() - Matches all variables, useful when reordering variables.
  • last_col() - The last column

select() - Multiple

You may use multiple forms together

ebola.data %>% select(last_col(), 2:3, `Total cases`) %>% head()
Country Report Date Country Case def. Total cases
2015-12-27 Guinea Confirmed 3351
2015-12-27 Guinea Probable 453
2015-12-27 Guinea Suspected 0
2015-12-27 Guinea All 3804
2015-05-09 Liberia Confirmed 3151
2015-05-09 Liberia Probable 1879

Subsetting data

Subset data with the filter() function.

The base R version is subset, but it is FAR less robust.

It takes the form of

filter(data, expr1, expr2, ...)

where data is the data set, and expr1, expr2, … are the criteria expressions evaluated in the context of the data. Data must meet all ctriteria to remain.

filter() Example

Subset data to only confirmed cases for Nigeria.

filter( ebola.data
      , Country == 'Nigeria'
      , `Case def.` == 'Confirmed'
      )
SheetName Country Case def. Total cases Total deaths Country Report Date
Jan 06, 2016 Nigeria Confirmed 19 7 2014-10-19
Dec 30, 2015 Nigeria Confirmed 19 7 2014-10-19
Dec 23, 2015 Nigeria Confirmed 19 7 2014-10-19
Dec 16 Nigeria Confirmed 19 7 2014-10-19
Dec 9 Nigeria Confirmed 19 7 2014-10-19
Dec 2 Nigeria Confirmed 19 7 2014-10-19
Nov 25 Nigeria Confirmed 19 7 2014-10-19
Nov 18 Nigeria Confirmed 19 7 2014-10-19
Nov 11 Nigeria Confirmed 19 7 2014-10-19
Nov 4 Nigeria Confirmed 19 7 2014-10-19
Oct 28 Nigeria Confirmed 19 7 2014-10-19
Oct 21 Nigeria Confirmed 19 7 2014-10-19
Oct 14 Nigeria Confirmed 19 7 2014-10-19
Oct 7 Nigeria Confirmed 19 7 2014-10-19
Sep 30 Nigeria Confirmed 19 7 2014-10-19
Sep 23 Nigeria Confirmed 19 7 2014-10-19
Sep 16 Nigeria Confirmed 19 7 2014-10-19
Sep 9 Nigeria Confirmed 19 7 2014-10-19
Sep 2 Nigeria Confirmed 19 7 2014-10-19
Aug 26 Nigeria Confirmed 19 7 2014-10-19
Aug 19 Nigeria Confirmed 19 7 2014-10-19
Aug 12 Nigeria Confirmed 19 7 2014-10-19
Aug 5 Nigeria Confirmed 19 7 2014-10-19
July 29 Nigeria Confirmed 19 7 2014-10-19
July 22 Nigeria Confirmed 19 7 2014-10-19
July 15 Nigeria Confirmed 19 7 2014-10-19
July 8 Nigeria Confirmed 19 7 2014-10-19
July 1 Nigeria Confirmed 19 7 2014-10-19
June 24 Nigeria Confirmed 19 7 2014-10-19
June 17 Nigeria Confirmed 19 7 2014-10-19
June 10 Nigeria Confirmed 19 7 2014-10-19
June 3 Nigeria Confirmed 19 7 2014-10-19
May 27 Nigeria Confirmed 19 7 2014-10-19
May 20 Nigeria Confirmed 19 7 2014-10-19
May 13 Nigeria Confirmed 19 7 2014-10-19
May 06 Nigeria Confirmed 19 7 2014-10-19
Apr 29 Nigeria Confirmed 19 7 2014-10-19
Apr 22 Nigeria Confirmed 19 7 2014-10-19
Apr 15 Nigeria Confirmed 19 7 2014-10-19
Apr 08 Nigeria Confirmed 19 7 2014-10-19
Apr 01 Nigeria Confirmed 19 7 2014-10-19
Mar 25 Nigeria Confirmed 19 7 2014-10-19
Mar 18 Nigeria Confirmed 19 7 2014-10-19
Mar 11 Nigeria Confirmed 19 7 2014-10-19
Mar 4 Nigeria Confirmed 19 7 2014-10-19
Feb 25 Nigeria Confirmed 19 7 2014-10-19
Feb 18 Nigeria Confirmed 19 7 2014-10-19
Feb 11 Nigeria Confirmed 19 7 2014-10-19
Feb 04 Nigeria Confirmed 19 7 2014-10-19
Jan 28 Nigeria Confirmed 19 7 2014-10-19
Jan 21 Nigeria Confirmed 19 7 2014-10-19
Jan 14 Nigeria Confirmed 19 7 2014-10-19
Jan 07 Nigeria Confirmed 19 7 2014-10-19
Jan 06 Nigeria Confirmed 19 7 2014-10-19
Jan 05 Nigeria Confirmed 19 7 2014-10-19
Jan 02 Nigeria Confirmed 19 7 2014-10-19
Dec 31 Nigeria Confirmed 19 7 2014-10-19
Dec 30 Nigeria Confirmed 19 7 2014-10-19
Dec 29 Nigeria Confirmed 19 7 2014-10-19
Dec 26 Nigeria Confirmed 19 7 2014-10-19
Dec 24 Nigeria Confirmed 19 7 2014-10-19
Dec 23 Nigeria Confirmed 19 7 2014-10-19
Dec 22 Nigeria Confirmed 19 7 2014-10-19
Dec 19 Nigeria Confirmed 19 7 2014-10-19
Dec 17 Nigeria Confirmed 19 7 2014-10-19
Dec 15 Nigeria Confirmed 19 7 2014-10-19
Dec 10 Nigeria Confirmed 19 7 2014-10-19
Dec 03 Nigeria Confirmed 19 7 2014-10-19
Dec 01 Nigeria Confirmed 19 7 2014-10-19
Nov 26 Nigeria Confirmed 19 7 2014-10-19
Nov 21 Nigeria Confirmed 19 7 2014-10-19
Nov 19 Nigeria Confirmed 19 7 2014-10-19
Nov 14 Nigeria Confirmed 19 7 2014-10-19
Nov 12 Nigeria Confirmed 19 7 2014-10-19
Nov 7 Nigeria Confirmed 19 7 2014-10-19
Nov 5 Nigeria Confirmed 19 7 2014-10-19
Oct 31 Nigeria Confirmed 19 7 2014-10-19
Oct 29 Nigeria Confirmed 19 7 2014-10-19
Oct 25 Nigeria Confirmed 19 7 2014-10-19
Oct 22 Nigeria Confirmed 19 7 2014-10-19
Oct 17 Nigeria Confirmed 19 7 2014-10-14
Oct 15 Nigeria Confirmed 19 7 2014-10-12
Oct 10 Nigeria Confirmed 19 7 2014-10-08
Oct 8 Nigeria Confirmed 19 7 2014-10-05
Oct 3 Nigeria Confirmed 19 7 2014-10-01
Oct 1 Nigeria Confirmed 19 7 2014-09-28
Sep 26 Nigeria Confirmed 19 7 2014-09-23
Sep 24 Nigeria Confirmed 19 7 2014-09-21
Sep 22 Nigeria Confirmed 19 7 2014-09-19
Sep 18 Nigeria Confirmed 19 7 2014-09-14
Sep 16 2014 Nigeria Confirmed 19 7 2014-09-13
Sep 12 2014 Nigeria Confirmed 19 7 2014-09-07
Sep 8 2014 Nigeria Confirmed 19 7 2014-09-06
Sep 5 2014 Nigeria Confirmed 18 7 2014-09-05
Aug 29 2014 Nigeria Confirmed 15 6 2014-08-25

filter() Example 2

to perform an or use the single |

filter( ebola.data
      , (Country == 'Nigeria') | (Country == 'Sierra Leone')
      , `Case def.` == 'Confirmed'
      )

an alternate form would be to use %in%

filter( ebola.data
      , Country %in% c('Nigeria', 'Sierra Leone')
      , `Case def.` == 'Confirmed'
      )

distinct() - normalizing

From the previous filter example note that report date is repeated week after week.
:::{.keyfunction} To get only distinct observations, use distinct(). :::

distinct() - normalizing

`Confirmed Cases for Sierra Leone` <- 
filter( ebola.data
      , Country == 'Sierra Leone'
      , `Case def.` == 'Confirmed'
      ) %>% 
    select(last_col(), Country, `Case def.`, starts_with('Total')) %>% 
    distinct()
Country Report Date Country Case def. Total cases Total deaths
2015-11-08 Sierra Leone Confirmed 8704 3589
2015-11-01 Sierra Leone Confirmed 8704 3589
2015-10-25 Sierra Leone Confirmed 8704 3589
2015-10-18 Sierra Leone Confirmed 8704 3589
2015-10-11 Sierra Leone Confirmed 8704 3589
2015-10-04 Sierra Leone Confirmed 8704 3589
2015-09-27 Sierra Leone Confirmed 8704 3589
2015-09-20 Sierra Leone Confirmed 8704 3589
2015-09-13 Sierra Leone Confirmed 8704 3587
2015-09-06 Sierra Leone Confirmed 8699 3587
2015-08-30 Sierra Leone Confirmed 8698 3587
2015-08-23 Sierra Leone Confirmed 8697 3586
2015-08-16 Sierra Leone Confirmed 8697 3586
2015-08-09 Sierra Leone Confirmed 8697 3585
2015-08-02 Sierra Leone Confirmed 8695 3585
2015-07-26 Sierra Leone Confirmed 8694 3585
2015-07-19 Sierra Leone Confirmed 8692 3583
2015-07-12 Sierra Leone Confirmed 8688 3581
2015-07-05 Sierra Leone Confirmed 8674 3574
2015-06-28 Sierra Leone Confirmed 8665 3566
2015-06-21 Sierra Leone Confirmed 8657 3562
2015-06-14 Sierra Leone Confirmed 8649 3553
2015-06-07 Sierra Leone Confirmed 8635 3549
2015-05-31 Sierra Leone Confirmed 8620 3546
2015-05-24 Sierra Leone Confirmed 8608 3542
2015-05-17 Sierra Leone Confirmed 8605 3541
2015-05-10 Sierra Leone Confirmed 8597 3538
2015-05-03 Sierra Leone Confirmed 8595 3537
2015-04-26 Sierra Leone Confirmed 8586 3533
2015-04-19 Sierra Leone Confirmed 8575 3511
2015-04-12 Sierra Leone Confirmed 8563 3491
2015-04-05 Sierra Leone Confirmed 8554 3465
2015-03-29 Sierra Leone Confirmed 8545 3433
2015-03-22 Sierra Leone Confirmed 8520 3381
2015-03-15 Sierra Leone Confirmed 8487 3325
2015-03-08 Sierra Leone Confirmed 8428 3263
2015-03-01 Sierra Leone Confirmed 8370 3180
2015-02-22 Sierra Leone Confirmed 8289 3095
2015-02-15 Sierra Leone Confirmed 8212 3042
2015-02-08 Sierra Leone Confirmed 8135 2975
2015-02-01 Sierra Leone Confirmed 8059 2910
2015-01-25 Sierra Leone Confirmed 7968 2833
2015-01-18 Sierra Leone Confirmed 7903 2779
2015-01-11 Sierra Leone Confirmed 7786 2696
2015-01-04 Sierra Leone Confirmed 7602 2577
2015-01-03 Sierra Leone Confirmed 7570 2549
2014-12-31 Sierra Leone Confirmed 7476 2461
2014-12-28 Sierra Leone Confirmed 7354 2392
2014-12-27 Sierra Leone Confirmed 7326 2366
2014-12-24 Sierra Leone Confirmed 7160 2289
2014-12-21 Sierra Leone Confirmed 7017 2216
2014-12-20 Sierra Leone Confirmed 6975 2190
2014-12-17 Sierra Leone Confirmed 6856 2111
2014-12-14 Sierra Leone Confirmed 6702 1876
2014-12-13 Sierra Leone Confirmed 6638 1824
2014-12-07 Sierra Leone Confirmed 6375 1559
2014-11-30 Sierra Leone Confirmed 5978 1374
2014-11-28 Sierra Leone Confirmed 5831 1321
2014-11-23 Sierra Leone Confirmed 5441 1189
2014-11-18 Sierra Leone Confirmed 5152 1058
2014-11-16 Sierra Leone Confirmed 5056 1041
2014-11-11 Sierra Leone Confirmed 4683 978
2014-11-09 Sierra Leone Confirmed 4523 960
2014-11-04 Sierra Leone Confirmed 4149 921
2014-11-02 Sierra Leone Confirmed 4057 893
2014-10-29 Sierra Leone Confirmed 3778 0
2014-10-27 Sierra Leone Confirmed 3700 0
2014-10-22 Sierra Leone Confirmed 3389 1008
2014-10-19 Sierra Leone Confirmed 3223 986
2014-10-14 Sierra Leone Confirmed 2977 932
2014-10-12 Sierra Leone Confirmed 2849 926
2014-10-08 Sierra Leone Confirmed 2593 753
2014-10-05 Sierra Leone Confirmed 2455 725
2014-10-01 Sierra Leone Confirmed 2179 575
2014-09-28 Sierra Leone Confirmed 2076 574
2014-09-23 Sierra Leone Confirmed 1816 557
2014-09-21 Sierra Leone Confirmed 1745 552
2014-09-19 Sierra Leone Confirmed 1640 545
2014-09-14 Sierra Leone Confirmed 1513 517
2014-09-13 Sierra Leone Confirmed 1464 514
2014-09-07 Sierra Leone Confirmed 1287 478
2014-09-06 Sierra Leone Confirmed 1234 461
2014-09-05 Sierra Leone Confirmed 1146 443
2014-08-25 Sierra Leone Confirmed 935 380

Sorting Data

To sort data use arrange()

sort() is the base version but again, less robust.

Arrange allows you to give sorting criteria.

arrange() Example

`Confirmed Cases for Sierra Leone` %>% 
    arrange(`Country Report Date`, desc(`Case def.`)) %>% 
    head()
Country Report Date Country Case def. Total cases Total deaths
2014-08-25 Sierra Leone Confirmed 935 380
2014-09-05 Sierra Leone Confirmed 1146 443
2014-09-06 Sierra Leone Confirmed 1234 461
2014-09-07 Sierra Leone Confirmed 1287 478
2014-09-13 Sierra Leone Confirmed 1464 514
2014-09-14 Sierra Leone Confirmed 1513 517

Reformatting data

  • Wide Data
    • multiple observations for one unit are in columns
  • Long Data
    • multiple observations for one unit are in rows.

Task: make ‘Case def.’ separate columns

We would like to make Case def. separate columns there are however 2 possible response variables:

  1. Total Cases
  2. Total Deaths

Options?

  1. Subset to each value of Case Def. Then merge those together.
  2. Choose our variable of interest and discard the rest, then pivot or spread the column.
  3. Spread each column and then ‘bind’ the results together.

Option 1 - filter then merge

# Make subsets
confirmed <- ebola.data %>% filter(`Case def.` == 'Confirmed') %>% 
    select(-`Case def.`) %>% distinct() %>% 
    rename_at(vars(starts_with("total")), ~paste("Confirmed", .))
probable  <- ebola.data %>% filter(`Case def.` == 'Probable') %>% 
    select(-`Case def.`) %>% distinct() %>%  
    rename_at(vars(starts_with("total")), ~paste("Probable", .))
suspected <- ebola.data %>% filter(`Case def.` == 'Suspected') %>% 
    select(-`Case def.`) %>%  distinct() %>% 
    rename_at(vars(starts_with("total")), ~paste("Suspected", .))
all.cases <- ebola.data %>% filter(`Case def.` == 'All') %>% 
    select(-`Case def.`) %>%  distinct() %>% 
    rename_at(vars(starts_with("total")), ~paste("All", .))

Option 1 - filter then merge

# Join together
ebola.option1 <- 
confirmed %>% 
    full_join(probable) %>% 
    full_join(suspected) %>% 
    full_join(all.cases)
## Joining, by = c("SheetName", "Country", "Country Report Date")
## Joining, by = c("SheetName", "Country", "Country Report Date")
## Joining, by = c("SheetName", "Country", "Country Report Date")
glimpse(ebola.option1)
## Rows: 838
## Columns: 11
## $ SheetName                <chr> "Jan 06, 2016", "Jan 0...
## $ Country                  <chr> "Guinea", "Liberia", "...
## $ `Confirmed Total cases`  <int> 3351, 3151, 9, 8704, 1...
## $ `Confirmed Total deaths` <int> 2083, 0, 3, 3589, 0, 0...
## $ `Country Report Date`    <dttm> 2015-12-27, 2015-05-0...
## $ `Probable Total cases`   <int> 453, 1879, 0, 287, 0, ...
## $ `Probable Total deaths`  <int> 453, 0, 0, 208, 0, 0, ...
## $ `Suspected Total cases`  <int> 0, 5636, 0, 5131, 0, 0...
## $ `Suspected Total deaths` <int> 0, 0, 0, 158, 0, 0, 0,...
## $ `All Total cases`        <int> 3804, 10666, 9, 14122,...
## $ `All Total deaths`       <int> 2536, 4806, 3, 3955, 0...

Combining data

Use the join family of functions to merge data together:

  • inner_join(a, b) - keep only rows that match both a and b.
  • left_join(a, b) - keep all rows of a and add columns in b to the rows that match. Unmatched rows will contain missing values.
  • right_join(a, b) - same as left but swap a and b.
  • full_join(a, b) - keep all rows of both a and b.
  • semi_join(a, b) - keep all rows of a that match b, but don’t add columns from b.
  • anti_join(a, b) - keep only those rows of a that don’t match b.

Operations have these parameters:

  • by - variables to join on, defaults to common variables
  • suffix - sufixes to add to distinguish common variables that are not part of by

Option 2 - pick 1 & spread

ebola.option2 <- 
    ebola.data %>% 
    select(SheetName, Country, `Case def.`, `Total cases`, `Country Report Date`) %>% 
    tidyr::spread('Case def.', 'Total cases')
glimpse(ebola.option2)
## Rows: 838
## Columns: 7
## $ SheetName             <chr> "Apr 01", "Apr 01", "Apr ...
## $ Country               <chr> "Guinea", "Liberia", "Mal...
## $ `Country Report Date` <dttm> 2015-03-29, 2015-03-29, ...
## $ All                   <int> 3492, 9712, 8, 20, 1, 119...
## $ Confirmed             <int> 3068, 3151, 7, 19, 1, 854...
## $ Probable              <int> 414, 1879, 1, 1, 0, 287, ...
## $ Suspected             <int> 10, 4682, 0, 0, 0, 3142, ...

Option 3 - Spread each and merge

# spread each
cases <- ebola.data %>% select(-`Total deaths`) %>% 
    tidyr::spread('Case def.', 'Total cases')
deaths <-  ebola.data %>% select(-`Total cases`) %>% 
    tidyr::spread('Case def.', 'Total deaths')
ebola.option3 <- 
    full_join( cases, deaths
             , c('SheetName', 'Country', 'Country Report Date')
             , suffix = c(".cases", ".deaths")) 
glimpse(ebola.option3)
## Rows: 838
## Columns: 11
## $ SheetName             <chr> "Apr 01", "Apr 01", "Apr ...
## $ Country               <chr> "Guinea", "Liberia", "Mal...
## $ `Country Report Date` <dttm> 2015-03-29, 2015-03-29, ...
## $ All.cases             <int> 3492, 9712, 8, 20, 1, 119...
## $ Confirmed.cases       <int> 3068, 3151, 7, 19, 1, 854...
## $ Probable.cases        <int> 414, 1879, 1, 1, 0, 287, ...
## $ Suspected.cases       <int> 10, 4682, 0, 0, 0, 3142, ...
## $ All.deaths            <int> 2314, 4332, 6, 8, 0, 3799...
## $ Confirmed.deaths      <int> 1900, 0, 0, 7, 0, 3433, 0...
## $ Probable.deaths       <int> 414, 0, 0, 1, 0, 208, 0, ...
## $ Suspected.deaths      <int> 0, 0, 0, 0, 0, 158, 0, 0,...

Summarization

Summarization

summarise(data, ...)

Take the data and summarise it by performing the ... operations to it.

summarize( ebola.option3
         , 'Observations' = n()
         , 'Number of countries' = n_distinct(Country)
         , "# of Reporting dates" = n_distinct(`Country Report Date`)
         , max.cases = max(All.cases, na.rm=TRUE)
         , max.deaths = max(All.deaths, na.rm=TRUE)
         )
Observations Number of countries # of Reporting dates max.cases max.deaths
838 11 120 14122 4806

Grouped Summarization

group_by(data, ...)

Take the data and group it by variables specified in ...,
all subsequent operations should be done by group.

ebola.option3 %>% group_by(Country) %>% 
  summarise( "# of Reporting dates" = n_distinct(`Country Report Date`)
           , max.cases = max(All.cases, na.rm=TRUE)
           , max.deaths = max(All.deaths, na.rm=TRUE)
           )

Grouped Summarization

Country # of Reporting dates max.cases max.deaths
Guinea 91 3810 2536
Italy 26 1 0
Liberia 53 10666 4806
Liberia2 27 9 3
Mali 68 8 6
Nigeria 16 22 8
Senegal 15 3 0
Sierra Leone 84 14122 3955
Spain 16 1 0
United Kingdom 47 1 0
United States of America 74 4 1

Question

Remember the wide data problem?

Time to go back

Exercise

1. Decide on the variable of interest, our value
1.5 Summarise to reduce the data to one row per country x reporting date
2. spread out the number of cases by date.

5:00

Question

What should we do with our data?

This data set on it’s own is not very interesting.

Let’s build something interesting.

World Bank Data

The wbstats package provides access to the world bank database.

library(wbstats)
wbsearch('population', extra=TRUE)
wb(indicator ="SP.POP.TOTL", startdate = 2014, enddate=2014)

Run these commands investigate the output and let’s discuss.

Recoding Country

To merge the the world bank data to our ebola data we need a common country variable.

library(countrycode)
long.ebola <- mutate( ebola.option3
                    , iso3c = countrycode(Country, "country.name", "iso3c"))

Allways, allways, allways, check your results.

assert_that(!any(is.na(long.ebola$iso3c)))
## [1] TRUE

Get the desired population data

  • SP.URB.TOTL.ZS - Percentage of Population in Urban Areas (in % of Total Population)
  • SP.POP.TOTL.MA.ZS - Population, male (% of total)
  • SP.POP.TOTL - Population, total
  • EN.POP.DNST - Population density (people per sq km)
  • IN.POV.HCR.EST.TOTL - Poverty HCR Estimates (%) - Total
  • NY.GDP.PCAP.CD - GDP per capita (current US$)
pop.vars <- c( 'SP.URB.TOTL.ZS', 'SP.POP.TOTL.MA.ZS'
             , 'SP.POP.TOTL', 'EN.POP.DNST'
             , 'IN.POV.HCR.EST.TOTL', 'NY.GDP.PCAP.CD')
pop.data <- wb( country = unique(long.ebola$iso3c)
              , indicator = pop.vars
              , startdate = 2014, enddate=2014)

Look at the data

  1. What format is it in?

  2. Are there any problems?

  3. Did we get get everything we expected?

Reshape and join together

meta.pop.data <- select(pop.data, variable=indicatorID, label=indicator) %>% distinct()
our.data <- 
    pop.data %>% 
    select(iso3c, value, indicatorID) %>% 
    spread(indicatorID, value) %>% 
    right_join(long.ebola)
## Joining, by = "iso3c"

Exercise/break

Create a table 1

  • Restrict the data to the most recent only.
  • Columns should be Africa, Other, and Total.
  • Rows should be summaries of the variables we have.
    • minimum, median, mean, maximum …

15:00